An Exploratory Data and Network Analysis of Movies

Introduction

In this report, we will be analysing a dataset from Kaggle, which contains movies of different genres produced over a vast number of years. What makes this analysis interesting is that we can try and draw various conclusions based on a movie’s popularity, directors or actors involved, year of production, and so forth. Moreover, we can construct various networks in an attempt to find meaningful and interesting results. When inspecting a database of films from recent years, various interesting inferences are uncovered. A film may have a high rating yet low return on investment (ROI). Which genre would you guess is the most successful? Which actors do you think are the most popular?

We have split our Exploratory Data Analysis into four main parts:

Section
1 Introducing the Data
- We first try to understand the data and look at its content.
2 Pre-Processing
- We look at what needs to be altered or removed from the dataset.
- We try clean any dirty text.
- We try to minimise the dataset’s missing values.
3 Exploring the Data
- We conduct basic analysis on the dataset.
- We explore genres.
- We explore movie popularity.
- We look at profit, gross, and return of interests with movies.
- We conduct more advanced analysis on the dataset.
4 Network Analysis
- We measure the network (centrality, degree distribution, number of components, average degree)
- We use network measures to highlight certain nodes (actors) and see which measures of an actor will increase ratings and budgets.

Admin

Before we start, let’s keep this code chunk for importing the correct libraries and loading the appropriate dataset. We use pacman to load the following:

pacman::p_load(rjson, jsonlite, DT,  RJSONIO, data.table, dplyr, compareDF, prettydoc, rmdformats, VIM, ggplot2, stringr, tidyr, plotly, RColorBrewer, formattable, corrplot, ggpubr, ngram, syuzhet, tm, wordcloud, sentimentr, reshape2, rlist, gplots, plsgenomics, ggrepel, GGally, rmdformats)

We import the dataset like this:

movie_metadata <- read.csv("../data/movie_metadata.csv", sep=";")

In the next section we can introduce our dataset and look its content.


Introducing The Dataset

This section of the report is quite essential for our analysis. We cannot make any interesting inferences from the dataset if we do not know what is contained within it. In this section we will try to understand exactly what we are dealing with. Thereafter, we can begin to draw interesting results. We have already read in our dataset called movie_metadata, so we can see the following:

The dataset contains 28 unique columns/variables, each of which are described in the table below:

Variable Name Description
color Specifies whether a movie is in black and white or color
director_name Contains name of the director of a movie
num_critic_for_reviews Contains number of critic reviews per movie
duration Contains duration of a movie in minutes
director_facebook_likes Contains number of facebook likes for a director
actor_3_facebook_likes Contains number of facebook likes for actor 3
actor_2_name Contains name of 2nd leading actor of a movie
actor_1_facebook_likes Contains number of facebook likes for actor 1
gross Contains the amount a movie grossed in USD
genres Contains the sub-genres to which a movie belongs
actor_1_name Contains name of the actor in lead role
movie_title Title of the Movie
num_voted_users Contains number of users votes for a movie
cast_total_facebook_likes Contains number of facebook likes for the entire cast of a movie
actor_3_name Contains the name of the 3rd leading actor of a movie
facenumber_in_poster Contains number of actors faces on a movie poster
plot_keywords Contains key plot words associated with a movie
movie_imdb_link Contains the link to the imdb movie page
num_user_for_reviews Contains the number of user generated reviews per movie
language Contains the language of a movie
country Contains the name of the country in which a movie was made
content_rating Contains maturity rating of a movie
budget Contains the amount of money spent in production per movie
title_year Contains the year in which a film was released
actor_2_facebook_likes Contains number of facebook likes for actor 2
imdb_score Contains user generated rating per movie
aspect_ratio Contains the size of the aspect ratio of a movie
movie_facebook_likes Number of likes of the movie on its Facebook Page

Furthermore, the dataset contains 5043 movies, spanning accross 96 years in 46 countries. There are 1693 unique director names and 5390 number of actors/actresses. Around 79% of the movies are from the USA, 8% from UK, and 13% from other countries.

The structure of the dataset can also be used to understand our data. We can run the following code chunk to see its structure.

# Get structure of dataset
str(movie_metadata)
## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
##  $ actor_1_name             : chr  "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
##  $ movie_title              : chr  "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : chr  "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
##  $ movie_imdb_link          : chr  "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : chr  "English" "English" "English" "English" ...
##  $ country                  : chr  "USA" "USA" "UK" "USA" ...
##  $ content_rating           : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

In the next section we can start preparing the dataset for analyis by removing or simplifying some of the data.


Pre-Processing Data

In this part of the report we attempt to look for various things that may have a negative or insignificant impact on the inferences we make on the dataset. Once we have sufficiently cleaned and prepared the dataset, we can commence with drawing various conclusions from the graphs we generate.

Duplicate Rows

In movie_metadata, we have some duplicate rows, so we want to remove the 45 duplicated rows and keep the unique ones.

# find duplicated rows
sum(duplicated(movie_metadata))
## [1] 45
# Remove duplicated rows
movie_metadata <- movie_metadata[!duplicated(movie_metadata), ]

Missing Values

Let’s have a look at the number of NA values in our dataset:

# Find NA values
colSums(sapply(movie_metadata, is.na))
##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                        49                        15 
##   director_facebook_likes    actor_3_facebook_likes 
##                       103                        23 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         7 
##                     gross                    genres 
##                       874                         0 
##              actor_1_name               movie_title 
##                         0                         0 
##           num_voted_users cast_total_facebook_likes 
##                         1                         1 
##              actor_3_name      facenumber_in_poster 
##                         0                        13 
##             plot_keywords           movie_imdb_link 
##                         0                         0 
##      num_user_for_reviews                  language 
##                        22                         0 
##                   country            content_rating 
##                         0                         0 
##                    budget                title_year 
##                       488                       108 
##    actor_2_facebook_likes                imdb_score 
##                        14                         1 
##              aspect_ratio      movie_facebook_likes 
##                       328                         1

To help visualise this, have a look at the following heatmap of the missing values:

# Visualise Missing Values
missing.values <- aggr(movie_metadata, sortVars = T, prop = T, sortCombs = T, cex.lab = 1.5, cex.axis = .6, cex.numbers = 5, combined = F, gap = -.2)

## 
##  Variables sorted by number of missings: 
##                   Variable       Count
##                      gross 0.174869948
##                     budget 0.097639056
##               aspect_ratio 0.065626251
##                 title_year 0.021608643
##    director_facebook_likes 0.020608243
##     num_critic_for_reviews 0.009803922
##     actor_3_facebook_likes 0.004601841
##       num_user_for_reviews 0.004401761
##                   duration 0.003001200
##     actor_2_facebook_likes 0.002801120
##       facenumber_in_poster 0.002601040
##     actor_1_facebook_likes 0.001400560
##            num_voted_users 0.000200080
##  cast_total_facebook_likes 0.000200080
##                 imdb_score 0.000200080
##       movie_facebook_likes 0.000200080
##                      color 0.000000000
##              director_name 0.000000000
##               actor_2_name 0.000000000
##                     genres 0.000000000
##               actor_1_name 0.000000000
##                movie_title 0.000000000
##               actor_3_name 0.000000000
##              plot_keywords 0.000000000
##            movie_imdb_link 0.000000000
##                   language 0.000000000
##                    country 0.000000000
##             content_rating 0.000000000

Gross and Budget

Since gross and budget have too many missing values (874 and 488), and we want to keep these two variables for the following analysis, we can only delete rows with null values for gross and budget because imputation will not do a good job here.

# Find NA values for gross and budget
movie_metadata <- movie_metadata[!is.na(movie_metadata$gross), ]
movie_metadata <- movie_metadata[!is.na(movie_metadata$budget), ]
dim(movie_metadata)
## [1] 3857   28

The difference in observations have decreased by 4998 - 3857 = 1141 which is luckily only 22.8% of the previous total observations. Let’s have a look at how many complete cases we have.

Content Rating

# Look at all the different types of content ratings
table(movie_metadata$content_rating)
## 
##            Approved         G        GP         M     NC-17 Not Rated 
##        51        17        91         1         2         6        42 
##    Passed        PG     PG-13         R   Unrated         X 
##         3       573      1314      1723        24        10

According to the history of naming these different content ratings, we find M = GP = PG, X = NC-17. We want to replace M and GP with PG, replace X with NC-17, because these two are what we use nowadays.

movie_metadata$content_rating[movie_metadata$content_rating == 'M']   <- 'PG' 
movie_metadata$content_rating[movie_metadata$content_rating == 'GP']  <- 'PG' 
movie_metadata$content_rating[movie_metadata$content_rating == 'X']   <- 'NC-17'

We want to replace Approved, Not Rated, Passed, Unrated with the most common rating R.

movie_metadata$content_rating[movie_metadata$content_rating == 'Approved']  <- 'R' 
movie_metadata$content_rating[movie_metadata$content_rating == 'Not Rated'] <- 'R' 
movie_metadata$content_rating[movie_metadata$content_rating == 'Passed']    <- 'R' 
movie_metadata$content_rating[movie_metadata$content_rating == 'Unrated']   <- 'R' 
movie_metadata$content_rating <- factor(movie_metadata$content_rating)
table(movie_metadata$content_rating)
## 
##           G NC-17    PG PG-13     R 
##    51    91    16   576  1314  1809

Blanks should be taken as missing value. Since these missing values cannot be replaced with reasonable data, we delete these rows.

# Remove rows with blank content ratings
movie_metadata <- movie_metadata[!(movie_metadata$content_rating %in% ""),]

Delete (Some) Rows

colSums(sapply(movie_metadata, is.na))
##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                         1                         0 
##   director_facebook_likes    actor_3_facebook_likes 
##                         0                         6 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         1 
##                     gross                    genres 
##                         0                         0 
##              actor_1_name               movie_title 
##                         0                         0 
##           num_voted_users cast_total_facebook_likes 
##                         0                         0 
##              actor_3_name      facenumber_in_poster 
##                         0                         6 
##             plot_keywords           movie_imdb_link 
##                         0                         0 
##      num_user_for_reviews                  language 
##                         0                         0 
##                   country            content_rating 
##                         0                         0 
##                    budget                title_year 
##                         0                         0 
##    actor_2_facebook_likes                imdb_score 
##                         2                         0 
##              aspect_ratio      movie_facebook_likes 
##                        55                         0

We remove aspect_ratio because 1 it has a lot of missing values and 2 we will not be looking into the impact that it has on other data (we assume that it doesn’t).

# Remove aspect_ratio column
movie_metadata <- subset(movie_metadata, select = -c(aspect_ratio))

Add a Column

Gross and Budget

We have gross and budget information. So let’s add two colums: profit and percentage return on investment for further analysis.

# add profit and return of investment column
movie_metadata <- movie_metadata %>% 
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100)

Remove (Some) Columns

Colour

Next, we take a look at the influence of colour vs black and white.

# Get colour display types of movies
table(movie_metadata$color)
## 
##                   Black and White            Color 
##                2              124             3680

Since 3.4%of the data is in black and white, we can remove the color column it.

# delete colour
movie_metadata <- subset(movie_metadata, select = -c(color))

Language

Let’s have a look at the different languages contained within the dataset.

# Look at different languages
table(movie_metadata$language)
## 
##            Aboriginal     Arabic    Aramaic    Bosnian  Cantonese 
##          2          2          1          1          1          7 
##      Czech     Danish       Dari      Dutch    English   Filipino 
##          1          3          2          3       3644          1 
##     French     German     Hebrew      Hindi  Hungarian Indonesian 
##         34         11          2          5          1          2 
##    Italian   Japanese     Kazakh     Korean   Mandarin       Maya 
##          7         10          1          5         14          1 
##  Mongolian       None  Norwegian    Persian Portuguese   Romanian 
##          1          1          4          3          5          1 
##    Russian    Spanish       Thai Vietnamese       Zulu 
##          1         24          3          1          1

Almost 95% movies are in English, which means this variable is nearly constant. Let’s remove it.

Country

Next, we can look at the different types of countries.

table(movie_metadata$country)
## 
##    Afghanistan      Argentina          Aruba      Australia        Belgium 
##              1              3              1             40              1 
##         Brazil         Canada          Chile          China       Colombia 
##              5             63              1             13              1 
## Czech Republic        Denmark        Finland         France        Georgia 
##              3              9              1            103              1 
##        Germany         Greece      Hong Kong        Hungary        Iceland 
##             79              1             13              2              1 
##          India      Indonesia           Iran        Ireland         Israel 
##              5              1              4              7              2 
##          Italy          Japan         Mexico    Netherlands       New Line 
##             11             15             10              3              1 
##    New Zealand         Norway  Official site           Peru    Philippines 
##             11              4              1              1              1 
##         Poland        Romania         Russia   South Africa    South Korea 
##              1              2              3              3              8 
##          Spain         Taiwan       Thailand             UK            USA 
##             22              2              4            316           3025 
##   West Germany 
##              1

Around 79% movies are from USA, 8% from UK, 13% from other countries. So we group other countries together to make this categorical variable with less levels: USA, UK, Others.

levels(movie_metadata$country) <- c(levels(movie_metadata$country), "Others")
movie_metadata$country[(movie_metadata$country != 'USA')&(movie_metadata$country != 'UK')] <- 'Others' 
movie_metadata$country <- factor(movie_metadata$country)
table(movie_metadata$country)
## 
## Others     UK    USA 
##    465    316   3025

Now that we’ve cleaned up our dataset, we can now continue to explore our data even further! In the next section we will be looking at genres, movie popularity, gross, profit, and many more other aspects pertinent to our data.


Analysing Data

When inspecting a dataset of movies over the past few years, various interesting inferences can be uncovered. A movie may have a high rating yet low return on investment. Which genre is the most successful? Which actors are the most popular? These are some of the questions we aim to answer in this section.

We can start by performing basic analyis on our data. Thereafter, we delve a bit deeper into more specific parts of the dataset, in hopes of uncovering interesting observations.

Basic Analysis

Let’s first have a look at the number of movies that are produced over the years.

ggplot(movie_metadata, aes(title_year)) +
  geom_bar() +
  labs(x = "Year movie was released", y = "Movie Count", title = "Number of Movies Released Per Year (1916 - 2016)") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(xintercept=c(1980), linetype="dotted")

From the graph, we see there aren’t many records of movies released before 1980. It’s better to remove those records because they might not be representative.

movie_metadata <- movie_metadata[movie_metadata$title_year >= 1980,]

Let’s have a look at the movie counts now:

ggplot(movie_metadata, aes(title_year)) +
  geom_bar() +
  labs(x = "Year movie was released", y = "Movie Count", title = "Number of Movies Released Per Year (1980 - 2016)") +
  theme(plot.title = element_text(hjust = 0.5))

Movie Genre Analysis

Now we can delve into more specific things regarding movies, like genres.

Top Genres

genre = movie_metadata['genres']
genre = data.frame(table(genre))
genre = genre[order(genre$Freq,decreasing=TRUE),]

# Top 20 genres with the most movies
ggplot(genre[1:20,], aes(x=reorder(factor(genre), Freq), y=Freq, alpha=Freq)) + 
  geom_bar(stat = "identity", fill="blue") + 
  geom_text(aes(label=Freq),hjust=1.2, size=3.5)+
  xlab("Genre") + 
  ylab("Number of Movies") + 
  ggtitle("Top 20 genres with the most movies") + 
  coord_flip()

Split Genres

As you can see, movies have multiple genres that its associated with. For analysis purposes, we choose to use the first word in the genre column, as this is likely the most accurate description.

head(movie_metadata$genres)
## [1] "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy"       
## [3] "Action|Adventure|Thriller"       "Action|Thriller"                
## [5] "Action|Adventure|Sci-Fi"         "Action|Adventure|Romance"

Let’s split the genres separated by “|” into 8 different columns.

# Split on "|"
genres_split <- str_split(movie_metadata$genres, pattern="[|]", n=2)

# Create Matrix
genres_matrix <- do.call(rbind, strsplit(movie_metadata$genres, '[|]'))

# Dataframe of genres
genres_df <- as.data.frame(genres_matrix)

genre_df consists of 8 columns, each with different genres. Let’s have a look at the frequency of all the genres.

# Collapse all genres into one column
genres_one_col <- gather(genres_df) %>% 
  select(value)

# Plot frequency of first column
genres_one_col %>%
  group_by(value) %>% 
  tally() %>% 
  filter(n >= 30) %>% 
  ggplot() +
  geom_bar(aes(x = value, y=n), stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Genre") +
  ylab("Frequency") +
  NULL

It is evident that the Drama and Comedy genre are the most popular to be produced. However, this does not mean that they are the most profitable, returning successful ROI’s. This will further be explored.

Previously we assumed that the first genre is the most applicable, therefore, we choose the first column as the genre for the movie and append it to the dataframe.

# Take old genres column away
movie_metadata <- subset(movie_metadata, select = -c(genres))

# Take first column of genres_df and add it to MAIN df
movie_metadata$genre <- genres_df$V1

How does this distribution look like over the years? Lets have a look at the frequency of genres between the period of 1980 and 2016.

movie_metadata %>%
 group_by(title_year, genre) %>%
 summarise(count = n()) %>%
 ggplot(aes(title_year, as.factor(genre))) +
 geom_tile(aes(fill=count),colour="white") +
 scale_fill_gradient(low="light blue",high = "dark blue") +
 xlab("Year of Movie") +
 ylab("Genre of Movie") +
 ggtitle("Heat Map of Movie Genres Produced Over the Years") +
  theme(panel.background = element_blank())

The heat map allows us to see that the popular genres with high frequencys are constantly being produced more and more often over the years. It is evident by the darker shades of blue becomming more prominent in the latter years.

Popularity Analysis

IMDB ratings VS Movie Count

ggplot(movie_metadata, aes(imdb_score)) +
 geom_histogram(bins = 50) +
 geom_vline(xintercept = mean(movie_metadata$imdb_score,na.rm = TRUE),colour = "blue") +
 ylab("Movie Count") +
 xlab("IMDB Rating") +
 ggtitle("IMDB Ratings for Movies") +
  ggplot2::annotate("text", label = "Mean IMDB rating",x = 6.2, y = 50, size = 3, colour = "yellow",angle=90)

It is evident from the histogram that the majority of movies are rated between 6-8 out of 10. If a movie has a lower rating, it is clearly very unpopular or very bad. If a movie has a higher rating, it is evidently fantastic.

Popularity over the years

#Creating the required subset of data 
movies_pop <- movie_metadata %>%
 select(title_year, movie_facebook_likes) %>%
 filter(title_year > 1980) %>%
 group_by(title_year) %>%
 summarise(avg = mean(movie_facebook_likes)) 

#Generating the popularity Vs time plot
 ggplot(movies_pop, aes(x = title_year, y = avg)) +
   geom_point() +
   geom_smooth() + 
   geom_vline(xintercept = c(1990,2004),colour = c("orange","blue")) +
   ylab("Mean Popularity Score") +
   xlab("Years") +
   ggplot2::annotate("text", label = "Facebook",x = 2003, y = 80, size = 3, colour = "blue",angle=90)+
   ggplot2::annotate("text", label = "IMDB",x = 1989, y = 80, size = 3, colour = "orange",angle=90)


A spike in popularity in 2004 is obvious owing to the fact that there is a clear rise in popularity score.The creation of facebook and the effect of social media influenced this.

Facebook Likes VS IMDB Score

movie_metadata %>%
  plot_ly(x = ~movie_facebook_likes, y = ~imdb_score, color = ~content_rating , mode = "markers", text = ~content_rating, alpha = 0.7, type = "scatter")

It is evident that as a movie has higher ratings, the number of likes on facebook increases. This is probably due to the fact that a critic giving a movie a good rating, will increase the want of people to see and find more out about the move.

Top 20 directors with highest average IMDB score

When examinning the directors with the highest IMDB scores, the highest average rating is 8.4. This means that movies with higher ratings are either excellent and/or fluke occurences.

movie_metadata %>%
  group_by(director_name) %>%
  summarise(avg_imdb = mean(imdb_score)) %>%
  arrange(desc(avg_imdb)) %>%
  top_n(20, avg_imdb) %>%
  formattable(list(avg_imdb = color_bar("orange")), align = 'l')
director_name avg_imdb
Tony Kaye 8.600000
Damien Chazelle 8.500000
Majid Majidi 8.500000
Ron Fricke 8.500000
Christopher Nolan 8.425000
Asghar Farhadi 8.400000
Marius A. Markevicius 8.400000
Richard Marquand 8.400000
Sergio Leone 8.400000
Lee Unkrich 8.300000
Lenny Abrahamson 8.300000
Pete Docter 8.233333
Hayao Miyazaki 8.225000
Joshua Oppenheimer 8.200000
Juan José Campanella 8.200000
Quentin Tarantino 8.200000
David Sington 8.100000
Je-kyu Kang 8.100000
Terry George 8.100000
Tim Miller 8.100000

The IMDB rating systen started in 1990’s. Social media platforms like Facebook had started in the mid 2000’s. Facebook influenced the number of rates per movie far higher than that of IMDB.

#Performing operations on Movies Vote Count over the years
 movies_vote1 <- movie_metadata %>%
  select(title_year, num_voted_users) %>%
  group_by(title_year) %>%
  summarise(count = sum(num_voted_users)) 

 ggplot(movies_vote1, aes(x = title_year, y = count/1000)) +
   geom_bar( stat = "identity") +
   geom_vline(xintercept = c(1990,2004),colour = c("orange","blue")) +
   ylab("Vote count (in thousands)") +
   xlab("Years") +
   ggplot2::annotate("text", label = "Facebook",x = 2003, y = 160, size = 3, colour = "blue",angle=90) + 
   ggplot2::annotate("text", label = "IMDB",x = 1989, y = 160, size = 3, colour = "orange",angle=90)

Vote Counts VS IMDB score

# breaking num_users_voted into 4 buckets 
movie_metadata$vote_bucket <- cut(movie_metadata$num_voted_users, 
                         c(0, 50000, 100000, 300000, 500000))
# plotting a boxplot for votes vs imdb_score  
bp <- na.omit(movie_metadata) %>% 
  ggplot(aes(x = vote_bucket, y = imdb_score)) +
  geom_boxplot(aes(fill = vote_bucket), alpha = 0.7,
               show.legend = F) +
  stat_summary(fun.y = mean, geom = "point",
               shape = 23) +
  coord_flip() +
  xlab("User Votes") +
  ylab("IMDB Score") +
  ggtitle("Box plot for User Votes vs IMDB Score") +
  theme_minimal() +
  theme(plot.title = element_text(size = 14, 
                                  color = "darkgrey",
                                  family = "Times"))
ggplotly(bp) %>%
  layout(margin = m, 
        xaxis = a, 
        yaxis = a,
        legend = list(orientation = "h", size = 4,
                       bgcolor = "#E2E2E2",
                       bordercolor = "darkgrey",
                       borderwidth = 1,
                       x = 0,
                       y = -0.3)) 
# scatter plot for user votes vs imdb score
scatter_plot(movie_metadata$num_voted_users, movie_metadata$imdb_score, 
     "User Votes",
     "IMDB Score",
     "Scatter plot for User Votes vs IMDB Score",
     "User Votes:",
     "<br>IMDB Score:",
     "<br>Title:",
     alpha = 0.3)

From the above scatter plot, it is evident that the majority of movie ratings are close to the 7.5 point.

Each line represents an average, for the imdb score, votes and user review.

# creating a data frame for average score, avg votes and, avg user reviews
# by title year
scat_year_score <- movie_metadata %>%
  group_by(title_year) %>%
  summarise(count = n(),
            IMDB_Score = round(mean(imdb_score),1),
            avg_votes = mean(num_voted_users),
            avg_user_review = round(mean(num_user_for_reviews)))

# plotting line graph for Avg score by title year 
lp1 <- line_graph(scat_year_score, 
                  scat_year_score$IMDB_Score, 
                  "Average Score")

# plotting line graph for Avg votes by title year
lp2 <- line_graph(scat_year_score, 
                  scat_year_score$avg_votes, 
                  "Average Votes")

# plotting line graph for Avg reviews by title year
lp3 <- line_graph(scat_year_score,
                  scat_year_score$avg_user_review,
                  "Average User Review") 


subplot(lp1, lp2, lp3, nrows = 3, heights = c(0.33, 0.33, 0.33))

Profit | Gross | Return of Interest

budget <- movie_metadata %>%
  select(movie_title, budget) %>%
  arrange(desc(budget)) %>%
  head(15)

x <- ggplot(budget, aes(x = reorder(movie_title, -desc(budget)), y = budget/1000000)) +
  geom_bar( stat = "identity")+ 
  theme(axis.text.x=element_text(hjust=1))+
  ggtitle("Movie Budgets")+
  xlab("")+
  ylab("Budget (in Millions)") + 
  coord_flip()

rev <- movie_metadata %>%
  select(movie_title, gross) %>%
  arrange(desc(gross)) %>%
  head(15)

y <- ggplot(rev, aes(x = (reorder(movie_title, -desc(gross))), y = gross/1000000)) +
  geom_bar( stat = "identity")+ 
  theme(axis.text.x=element_text(hjust=1))+
  ggtitle("Movie Gross")+
  xlab("")+
  ylab("Gross (in Millions)") + 
  coord_flip() 

ggarrange(x, y,
          labels = c("A", "B"),
          ncol = 1, nrow = 2)

In the above graph, it is evident that the movies with the highest budgets, do not essentially mean that they will equate to a high gross profit.

Most Successful Directors based on Profit

#Top 20 most successful directors
movie_metadata %>%
        group_by(director_name) %>%
  mutate(profit = gross - budget)%>%
select(director_name, budget, gross, profit) %>%
na.omit() %>% 
summarise(films = n(), budget = sum(as.numeric(budget)), gross = sum(as.numeric(gross)), profit = sum(as.numeric(profit))) %>%
mutate(avg_per_film = profit/films) %>%
arrange(desc(avg_per_film)) %>% 
top_n(20, avg_per_film) %>%
ggplot( aes(x = films, y = avg_per_film/1000000)) + 
geom_point(size = 1, color = "blue") + 
geom_text_repel(aes(label = director_name), size = 3, color = "blue") + 
xlab("Number of Films") + ylab("Avg Profit $millions") + 
ggtitle("Most Successful Directors")

Looking at the most succesful directos, one can determine richer directors by singular succesful films, such as Tim Miller, or creating an array of succesful films with large budgets, such as James Cameron.

Top 20 movies based on its Profit

#Top 20 movies based on its Profit
movie_metadata %>% 
  filter(title_year %in% c(2000:2016)) %>%
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) %>%
  arrange(desc(profit)) %>% 
  top_n(20, profit) %>%
  ggplot(aes(x=budget/1000000, y=profit/1000000)) + 
  geom_point(size = 2) + 
  geom_smooth(size = 1) + 
  geom_text_repel(aes(label = movie_title), size = 3) + 
  xlab("Budget $million") + 
  ylab("Profit $million") + 
  ggtitle("20 Most Profitable Movies")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

When assessing the top 20 movies based on profit, Avatar is in the highest profit region, regioning in a similar area to director James cameron.

Top 20 movies based on its Return on Investment

#Top 20 movies based on its Return on Investment
movie_metadata %>% 
  filter(budget >100000) %>%
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) %>%
  arrange(desc(profit)) %>% 
  top_n(20, profit) %>%
  ggplot(aes(x=budget/1000000, y=return_on_investment_perc)) + 
  geom_point(size = 2) + 
  geom_smooth(size = 1) + 
  geom_text_repel(aes(label = movie_title), size = 3) + 
  xlab("Budget $million") + 
  ylab("Percent Return on Investment") + 
  ggtitle("20 Most Profitable Movies based on its Return on Investment")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Sucessful directors such as George Lucas also have profitable movies.

Further Analysis

Commercial success Vs Critical acclaim

movie_metadata %>%
  top_n(15, profit) %>%
  ggplot(aes(x = imdb_score, y = gross/10^6, size = profit/10^6, color = content_rating)) + 
  geom_point() + 
  geom_hline(aes(yintercept = 600)) + 
  geom_vline(aes(xintercept = 7.75)) + 
  geom_text_repel(aes(label = movie_title), size = 4) +
  xlab("Imdb score") + 
  ylab("Gross money earned in million dollars") + 
  ggtitle("Commercial success Vs Critical acclaim") +
  ggplot2::annotate("text", x = 8.5, y = 700, label = "High ratings \n & High gross") +
  theme(plot.title = element_text(hjust = 0.5))

In the above graph we can compare content rating to the content rating inference as well as the higher grossing films with sucessful directors.

Correlation Heatmap

ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size = 2.8, size = 2, hjust = .85) +
  ggtitle("Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.5))
## Warning in ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size
## = 2.8, : data in column(s) 'director_name', 'actor_2_name', 'actor_1_name',
## 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link',
## 'language', 'country', 'content_rating', 'genre', 'vote_bucket' are not
## numeric and were ignored

Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors.

According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes, and both actor2 and actor3 are also somehow correlated to the total. So we want to modify them into two variables: actor_1_facebook_likes and other_actors_facebook_likes.

There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews.

Sentiment Analysis

options(scipen = 999)
IMDB <- read.csv("../data/real_deal.csv")

#Removing Duplicates
IMDB <- IMDB[!duplicated(IMDB), ]

#Removing White-Spaces and Special Characters
IMDB$movie_title <- gsub("Â", "", as.character(factor(IMDB$movie_title)))
IMDB$movie_title <- str_trim(IMDB$movie_title, side = "right")

IMDB <- subset(IMDB, select = -c(genres))

colSums(sapply(IMDB, is.na))
##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                        49                        15 
##   director_facebook_likes    actor_3_facebook_likes 
##                       103                        23 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         7 
##                     gross              actor_1_name 
##                       874                         0 
##               movie_title           num_voted_users 
##                         0                         0 
## cast_total_facebook_likes              actor_3_name 
##                         0                         0 
##      facenumber_in_poster             plot_keywords 
##                        13                         0 
##           movie_imdb_link      num_user_for_reviews 
##                         0                        21 
##                  language                   country 
##                         0                         0 
##            content_rating                    budget 
##                         0                       487 
##                title_year    actor_2_facebook_likes 
##                       107                        13 
##                imdb_score              aspect_ratio 
##                         0                       327 
##      movie_facebook_likes 
##                         0

We can infer a strong correlation between a movie’s Facebook likes and its IMDB Score. This is expected, as a higher individual rating relates to higher viewer satisfaction; and hence it is expected to see an increase in positive online presence. Initially, this graph was constructed to see if there’d be a difference between viewer enjoyment and movie rating. Movie databases are often critisized for the nature of their rating scales, made by critics and placing priority on sentiments and plot, which may not fully coincide with viewer enjoyment. However, as seen below, this is not the case using IMDB’s Scoring.

#Plotting Facebook likes against IMDB score
ggplot(data = IMDB, aes(x = imdb_score, y = movie_facebook_likes)) +
  geom_point() +
  stat_smooth(method="lm")

#Plotting Aspect Ratio against Facebook likes
IMDB %>% 
  filter(aspect_ratio==1.85 | aspect_ratio==2.35) %>%
  mutate(aspect_ratio=as.character((aspect_ratio))) %>%
  ggplot() +
  geom_point(aes(x=aspect_ratio, y=movie_facebook_likes))

From the above graph we can determine that users enjoy a higher aspect ratio on films.

#Constructing Top 20
keywords_split <- str_split(IMDB$plot_keywords, pattern="[|]", n=5)

keywords_matrix <- do.call(rbind, strsplit(as.character(IMDB$plot_keywords), "[|]"))
## Warning in (function (..., deparse.level = 1) : number of columns of result
## is not a multiple of vector length (arg 99)
keywords_df <- as.data.frame(keywords_matrix)

names(keywords_df) <- c("one", "two", "three", "four", "five")

keywords_one_col <- gather(keywords_df) %>% 
  select(value)

keywords_one_col_freq <- keywords_one_col %>%
  group_by(value) %>%
  tally()

top_20 <- keywords_one_col_freq %>%
  select(value, n) %>%
  top_n(20)
## Selecting by n
movies_with_keywords <- data.frame()
IMDB_keyword_movie <- data.frame()
keywords_one_col %>%
  group_by(value) %>% 
  tally() %>% 
  filter(n > 30) %>% 
  ggplot() +
  geom_bar(aes(x = value, y=n), stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

top_20
## # A tibble: 20 x 2
##    value                  n
##    <chr>              <int>
##  1 alien                 82
##  2 boy                   72
##  3 death                132
##  4 dog                   58
##  5 drugs                 66
##  6 fbi                   71
##  7 female protagonist    58
##  8 friend               165
##  9 friendship            67
## 10 high school           89
## 11 island                57
## 12 love                 194
## 13 marriage              60
## 14 money                 61
## 15 murder               160
## 16 new york city         92
## 17 police               127
## 18 prison                62
## 19 revenge               70
## 20 school                73

Above we can see the most popular keywords describing the films in the dataset.

#Placing Top 20 Words Against Movie Success

IMDB_true_false <- IMDB 
for (keyword in top_20$value) {
  IMDB_true_false <- cbind(IMDB_true_false, ifelse(str_detect(IMDB$plot_keywords, keyword), "TRUE", "FALSE"))
}

for (i in 1:20) {
  reference <- 27
  names(IMDB_true_false)[reference + i] <- top_20$value[i]  
}

truey <- data.frame()
for (keyword in top_20$value) {
  truey <- rbind(truey, IMDB_true_false %>%
                            filter(get(keyword) == TRUE) %>% select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords)) %>%
    distinct(movie_title, .keep_all = T)
}



falsy <- data.frame()
for (keyword in top_20$value) {
  falsy <- rbind(falsy, IMDB_true_false %>%
                            filter(get(keyword) == FALSE) %>% select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords)) %>%
    distinct(movie_title, .keep_all = T)
}
truey <- truey %>%
  mutate(tri = "Top 20 Word")

falsy <- falsy %>%
  mutate(tri = "NOT Top 20 Word")

truey_falsy <- full_join(falsy, truey, by = c("movie_title", "gross", "imdb_score", "movie_facebook_likes", "plot_keywords"))

truey_falsy <- truey_falsy %>%
  mutate(tri = coalesce(tri.y, tri.x)) %>%
  select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords, tri)

truey_falsy <-  truey_falsy %>%
  group_by(tri) %>%
  na.omit() %>%
  mutate(avg = mean(gross))

truey_falsy_graph <- summarise(truey_falsy, avg = mean(gross))
truey_falsy_graph %>%
  ggplot() +
  geom_bar(aes(x = tri, y=avg, fill=tri), stat="identity", position="stack") +
  theme(axis.text.x = element_blank()) 

#Sentiment

keywords_from_split <- data.frame(lapply(keywords_split, "length<-", max(lengths(keywords_split))))

#Creating a list of titles and years seperatored by "|"
titles <-  IMDB %>%
  select(movie_title)
years <- IMDB %>%
  select(title_year)

titles_years_list <- list()

for (x in 1:4998) {
  seperator="|"
  titles_years_list <- c(titles_years_list, paste(titles$movie_title[x], years$title_year[x], sep="|"))
}

#Renaming columns to movie names
movie_name_list <- IMDB$movie_title
names(keywords_from_split) <- titles_years_list

#Adding column to the begining
key_names <- c('key_word_1', 'key_word_2', 'key_word_3', 'key_word_4', 'key_word_5')
keywords_from_split <- cbind(test_keys = key_names, keywords_from_split)

#Moving columns to rows
keywords_from_split <- melt(keywords_from_split, id = "test_keys")

#Creating new column for year
keywords_from_split_matrix <- do.call(rbind, strsplit(as.character(keywords_from_split$variable), "[|]"))
keywords_from_split <- cbind(keywords_from_split, keywords_from_split_matrix)
names(keywords_from_split) <- c("Key #", "Combined", "Keyword", "Title", "Year")
keywords_from_split = keywords_from_split %>%
  select(`Key #`, Title, Year, Keyword)

#Replacing blank cells within keywords to NA's
keywords_from_split$Keyword[keywords_from_split$Keyword == ""] <- NA
#Removing rows containing NA's
keywords_from_split <- na.omit(keywords_from_split)

#All unique years
all_years <- keywords_from_split %>% select(Year)
all_years <- distinct(all_years)
all_years <- na.omit(all_years)
all_years <- all_years %>%
  filter(Year != "NA")

#Function for sentiment per year
yearly_sentiment <- function(year, df) {
  amount <- nrow(df %>%
    select(Year) %>%
    filter(Year == year))
  df <- df %>%
    filter(Year == year)
  sentiments <- get_nrc_sentiment(as.character(df[4]))
  for (i in 1:length(sentiments)) {
    sentiments[i] <- sentiments[i]/amount
  }
  year_sentiment <- cbind(year, sentiments)
  return (year_sentiment)
}

sentiments <- data.frame()

#For-loop to capture all years
for (i in all_years$Year) {
  sentiments <- rbind(sentiments,yearly_sentiment(i, keywords_from_split))
}

#Making years integers
sentiments$year <- strtoi(sentiments$year)

#Sort by year
sentiments <- sentiments[with(sentiments, order(year)), ]
#sentiments <- sentiments[order(year), ]

#Heatmap for sentiments
rnames <- sentiments[,1]
mat_sentiments <- data.matrix(sentiments[,2:ncol(sentiments)])
rownames(mat_sentiments) <- rnames
mat_sentiments <- t(mat_sentiments)

df_sentiment <-  as.data.frame(mat_sentiments)
names_emotions <- c("anger", "anticipation", "disgust","fear","joy","sadness","surprise","trust","negative","positive")

sentiments_graph <- cbind(names_emotions, df_sentiment)
#Run if there is a problem with heatmap
dev.off()
## null device 
##           1
#Heatmap
heatmap.2(mat_sentiments, Rowv=NA, Colv=NA, scale="row", col=colorRampPalette(c("white","darkblue")),  margins=c(5,10), trace = "none")
## Warning in heatmap.2(mat_sentiments, Rowv = NA, Colv = NA, scale = "row", :
## Discrepancy: Rowv is FALSE, while dendrogram is `both'. Omitting row
## dendogram.
## Warning in heatmap.2(mat_sentiments, Rowv = NA, Colv = NA, scale = "row", :
## Discrepancy: Colv is FALSE, while dendrogram is `column'. Omitting column
## dendogram.
#Filtering sentiments graph
sentiments_graph_filter <- sentiments %>%
  filter(year >= 2000)

#Other graphs
# Showing trends between sentiment from the 2000's 
ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = anger)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = anticipation)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = disgust)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = fear)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = joy)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = sadness)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = surprise)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = trust)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = negative)) + 
  geom_point(alpha = 0.5) + 
  geom_line()

ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = positive)) + 
  geom_point(alpha = 0.5) + 
  geom_line()


Network Analysis

Clarice, Daven, Lucia, Christopher and Indurain

September 6, 2019